Reduce memory usage by GroupReadsByUmi in a corner case #774

tfenne · 2022-02-17T23:15:54Z

No description provided.

… there are many reads with the same start/stop and edits=0.

codecov-commenter · 2022-02-17T23:58:36Z

Codecov Report

Merging #774 (771c87a) into master (9054893) will decrease coverage by 0.07%.
The diff coverage is 80.55%.

❗ Current head 771c87a differs from pull request most recent head 84ef127. Consider uploading reports for the commit 84ef127 to get more accurate results

@@            Coverage Diff             @@
##           master     #774      +/-   ##
==========================================
- Coverage   95.57%   95.50%   -0.08%     
==========================================
  Files         119      119              
  Lines        6805     6830      +25     
  Branches      476      450      -26     
==========================================
+ Hits         6504     6523      +19     
- Misses        301      307       +6

Flag	Coverage Δ
unittests	`95.50% <80.55%> (-0.08%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...cala/com/fulcrumgenomics/umi/GroupReadsByUmi.scala	`94.41% <78.78%> (-2.46%)`	⬇️
...in/scala/com/fulcrumgenomics/umi/CorrectUmis.scala	`98.70% <100.00%> (+0.03%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9054893...84ef127. Read the comment docs.

tfenne · 2022-02-17T23:59:50Z

src/main/scala/com/fulcrumgenomics/umi/CorrectUmis.scala

+              logger.warning(s"Read (${rec.name}) detected with unexpected length UMI(s): ${sequences.mkString(" ")}.")
+              logger.warning(s"Expected UMI length: ${umiLength}")


Apologies for the unrelated change here. I had a stupid type that took me far too long to figure out because I used -u instead of -U and CorrectUmis happily decided my filename was the sole UMI sequence to correct to. It the message here had told me my expcted UMI length was 30+ that would have helped!

nh13

I think there's a subtle change to the behavior which we could think of as an improvement, but it does change the behavior, so we should discuss.

src/main/scala/com/fulcrumgenomics/umi/GroupReadsByUmi.scala

nh13 · 2022-02-18T06:34:13Z

src/main/scala/com/fulcrumgenomics/umi/GroupReadsByUmi.scala

+      iterator.hasNext &&
+      firstEnds == ReadInfo(iterator.head.r1.get) &&
+      // This last condition only works because we put a canonicalized UMI into rec(assignTag) if canTakeNextGroupByUmi
+      (!canTakeNextGroupByUmi || firstUmi == iterator.head.r1.get.apply[String](this.assignTag))


So this does change behavior in a subtle way. Suppose we have three templates that have all but the same assign tag. Let's say AAAA, GGGGG, GGGGT, with --min-umi-length=3.

Previously, all templates would be read into memory, and truncateUmis would truncate to the length of smallest UMI observed in the group of templates, in this case 4bp long due to the AAAA (not 3 as per the command line!). So the three templates would have UMIs AAAA->AAAA, GGGGG->GGGG, and GGGGT->GGGG. So we'd assign two unique molecules (AAAA and GGGG, with the last molecule containing the last two reads).

In the new implementation, we truncate the raw UMI bases based on --min-umi-length=3 to set MI for sorting. So we'd truncate to length 3 for sorting: AAAA->AAA, GGG->GGG, and GGG->GGG. When we read back in after sorting, we read in the first template by itself (only read with MI:AAA), so no truncation is applied and it stays the same (AAAA). We then read in all templates with MI having GGG, which gives the second two templates. Now we go back to the raw UMIs to find the length of the shortest UMI of the two. Both are 5bp long, so we do not truncate and keep them the same (GGGGG and GGGGT). But now these UMIs differ, so we get three molecules!

One could argue that the new implementation is an improvement, but it does change behavior subtly.

Hrm, that is an interesting point. So basically if you have variable length single UMIs and have edits = 0 the behavior will be subtly different. My instinct is to call this an improvement and move on, but I don't have a great sense of who (or on what kind of data) the variable length support is used, so I'm not 100% sure.

Co-authored-by: Nils Homer <[email protected]>

tfenne added 2 commits February 17, 2022 16:14

Bump sbt assembly plugin version - new version seems much faster.

ada649f

A slightly hacky optimization to drastically reduce memory usage when…

4175cf6

… there are many reads with the same start/stop and edits=0.

tfenne self-assigned this Feb 17, 2022

tfenne added 2 commits February 17, 2022 16:18

Fixup changes in CorrectUmis

bcd8ff9

Take 2 at fixing the problem.

771c87a

tfenne commented Feb 17, 2022

View reviewed changes

tfenne marked this pull request as ready for review February 18, 2022 00:00

tfenne requested a review from nh13 February 18, 2022 00:00

nh13 requested changes Feb 18, 2022

View reviewed changes

tfenne and others added 2 commits February 18, 2022 05:25

Apply suggestions from code review

d503ea1

Co-authored-by: Nils Homer <[email protected]>

Extended a comment.

84ef127

tfenne merged commit 7176170 into master Feb 20, 2022

tfenne deleted the tf_group_reads_by_umi_speed_tweaks branch February 20, 2022 12:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

Reduce memory usage by GroupReadsByUmi in a corner case #774

Reduce memory usage by GroupReadsByUmi in a corner case #774

tfenne commented Feb 17, 2022

codecov-commenter commented Feb 17, 2022 •

edited

Loading

tfenne Feb 17, 2022

nh13 left a comment

nh13 Feb 18, 2022

tfenne Feb 18, 2022

		logger.warning(s"Read (${rec.name}) detected with unexpected length UMI(s): ${sequences.mkString(" ")}.")
		logger.warning(s"Expected UMI length: ${umiLength}")

Reduce memory usage by GroupReadsByUmi in a corner case #774

Reduce memory usage by GroupReadsByUmi in a corner case #774

Conversation

tfenne commented Feb 17, 2022

codecov-commenter commented Feb 17, 2022 • edited Loading

Codecov Report

tfenne Feb 17, 2022

Choose a reason for hiding this comment

nh13 left a comment

Choose a reason for hiding this comment

nh13 Feb 18, 2022

Choose a reason for hiding this comment

tfenne Feb 18, 2022

Choose a reason for hiding this comment

codecov-commenter commented Feb 17, 2022 •

edited

Loading